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DIRECT  NUMERICAL  SOLUTION  OF  LARGE  SETS 
OF  SIMULTANEOUS  EQUATIONS 

H.A.  KAMEL  AND  M.W.  MCCABE 

Aerospace  and  Mechanical  Engineering  Department 
University  of  Arizona,  Tucson,  AZ  85721,  U.S.A. 

Abstract  — The  paper  attempts  to  survey,  classify  and  compare  methods 
availab.  r the  solution  of  simultaneous  equations  on  a digital 
computer.  It  is  known  that,  for  direct  methods,  the  amount  of  central 
processor  time  required  to  perform  the  solution  varies  little  from  one 
method  to  the  other.  More  interesting  than  the  algorithm  is  the  data 
handling  method.  The  basis  for  comparison  here  is  core  utilization  and 
the  number  of  required  I/O  operations.  Low  core  utilization  means 
higher  program  capacity  and  increased  generality,  whereas  a low  I/O 
activity  Improves  efficiency  and  reduces  system  residence  time.  One  of 
the  Interesting  findings  in  a comparative  study  of  the  demands  on  computer 
resources  is  that  the  data  handling  method  is  indeed  the  deciding  factor, 

rather  than  the  mathematical  algorithm.  ^ 

In  order  to  avoid  too  mathematical  an  approach,  the  paper  recognizes 
the  origin  of  the  equations  in  the  mathematical  modelling  of  physical 
problems.  A classification  of  such  problems  is  attempted  and  its 
implications  with  regard  to  the  solution  procedures  are  assessed  whenever 
possible.  A number  of  parameters  are  suggested  for  use  in  program  per- 


formance measurements 
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I.  SYALQATIW  9F  39UHWH  HTOOTg 

Numerical  solutions  of  simultaneous  equations  may  be  performed  vith 

either  direct  or  Iterative  methods. ^ In  some  cases,  hybrid  methods, 
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combining  an  element  of  both,  may  be  employed.  * Iterative  methods 
have  the  advantages  of  a low  storage  requirement  and  possible  fast  con- 
vergence In  certain  cases,  but  their  rate  of  convergence  Is  unpredict- 
able. The  paper  concerns  Itself,  therefore,  primarily  with  direct  methods. 

Many  misleading  statements  and  conclusions  can  be  found  in  the  cur- 
rent literature  due  to  a lack  of  distinction  between  a basic  mathematical 
technique  and  its  Implementation  on  an  electronic  digital  computer.  Just 
as  an  excellent  mathematician  may  have  his  "tricks  of  the  trade,"  so  does 
a good  professional  programmer  have  access  to  special  knowledge  result- 
ing in  efficient  implementation  of  a basic  mathematical  algorithm.  The 
performance  of  a particular  implementation  is  Influenced  not  only  by 
the  excellence  of  the  programmer,  but  by  the  available  system  software 
and  hardware  at  a particular  installation.  The  presence  of  a large  core, 
direct  access  (or  index  sequential)  files,  high  speed  backing  storage, 

t 

efficient  FORTRAN  compiler  and  file  management  system,  etc.,  has  a great 

Impact  on  the  capacity  and  performance  of  a program  and  will  undoubtedly 

...  . I 

Influence  the  choice  of  the  algorithm.  In  view  of  this,  a classification 

of  data  handling  techniques  and  an  evaluation  of  their  demands  on  computer 

resources  are  attempted.  In  comparing  these  algorithms,  criterion  are  j 

selected  that  are  system  Independent,  as  much  as  possible.  The  findings 

are  then  applied  to  two  particular  installations. 
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It  Is  customary  to  refer  to  a solution  method  (mathematical  algorithm) 

by  the  name  of  an  originator  to  whom  the  method  Is  commonly  attributed 

(Gatiss,  Cholesky,  Crout,  etc.).  Instead  of  this,  two  main  solution 

strategies  are  defined,  the  trlangularizatlon  methods  (e.g.,  Gaussian 

elimination)  and  factorization  methods  (e.g.,  Crout,  Cholesky,  Inverse 
[4] 

decomposition  J).  Although  It  can  be  shown  that  the  factorization  and 

trlangularizatlon  methods  are  closely  related  in  a mathematical  sense^*5*^, 
the  computational  steps  and  sequence  of  data  retrieval.  In  a computer 

program,  are  distinctly  different  and  therefore  play  an  important  part  In 

any  comparison.  It  can  be  shown  that  the  greatest  difference  relates  to 

the  I/O  operations  rather  than  the  arithmetic  effort  required. 

II. 1 Trianaularization  Method  (Gaussiem  Elimination) 

Consider  the  equation 

K r - R (1) 

where  K is  a square  non-singular  coefficient  matrix,  R is  a right- 
hand  side  and  £ is  an  unknown  vector(s).  A series  of  row  transformations 
are  performed,  resulting  in  an  upper  triangular  coefficient  matrix.  The 
process  may  be  represented  mathematically  as 

(L  K)  r - (L  R)  (2) 

where  L is  a lower  triangular  matrix. 

Once  this  form  is  achieved,  the  solution  proceeds  through  a simple 
back-substitution  operation.  It  is  also  possible  to  perform  the  tri- 
angularlzatlon  in  reverse  order,  using  an  upper  triangular  matrix, 

(UK)  r - (U  R)  (2a) 
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followed  by  a forward  substitution 


II. 2 Factorization  Method  (Crout,  Cholesky,  Inverse  decomposition) 

In  this  method,  the  coefficient  matrix  is  factorized  (decomposed) 

Into  a number  of  matrices.  The  most  general  case  is  given  by 

K - L d U (3) 

where  L is  a lower  triangular  matrix  with  unit  diagonal  elements,  U is 
an  upper  triangular  matrix  with  unit  diagonal  elements  and  d is  a diagonal 
matrix.  It  is  also  possible  to  reverse  the  scheme  by  expressing  the  co- 
efficient matrix  as 

K - U d L . (3a) 

Whether  equations  (3)  or  (3a)  are  utilized,  the  program  proceeds  to 
compute  L,  and  U and' store  them.  Using  equation  (3),  equation  (1)  now 
takes  the  form 


L d U r - R. 


(4) 


The  solution  to  equation  (4)  is  given  by: 


r 


D"1d"1L"1R. 


(5) 


It  is  not  necessary,  or  advlsabla,  to  form  the  inverse  of  any  of  these 
matrices  explicitly  since  matrix  sparsity  is  thereby  destroyed.  One 
proceeds  to  evaluate  the  following  vectors  directly,  and  in  this  order. 


and  finally 


U 


(5a) 

(5b) 

(5c) 


In  the  Crout  method,  the  matrix  U is  multiplied  by  d to  form 


K 


such  that 


K - L U* 


Special  cases  arise.  In  the  case  of  structural  problems,  for 


example,  K is  a symaetrlc  positive  definite  matrix.  In  this  case. 


L - Ut. 


And  it  Is  only  necessary  to  compute  and  store  one  of  the  matrices  L 


and  D. 


In  the  Cholesky  decomposition,  the  square  root  of  the  diagonal 


matrix  <1  Is  obtained  and  multiplied  by  L so  that 


* 

K ■ L L 


where 


L*  - d*L 


In  both  the  Crout  and  Cholesky  methods  an  attempt  is  made  to 


eliminate  the  matrix  d from  the  formulation.  In  both  cases  the  variation 


on  the  basic  factorization  scheme  is  minor  and  causes  additional  problems 


Neither  method  results  In  an  appreciable  saving  of  storage  space  or 


seriously  affects  the  computing  effort.  Therefore,  the  original  for- 


mulation of  equations  (4)  and  (5)  will  be  used  for  non-symmetrlc  matrices 


In  additon,  equation  (7)  will  be  used  for  symmetric  matrices. 


For  both  basic  solution  procedures,  the  triangular Izatlon  and 


factorization  (decomposition),  one  can  safely  state  the  following: 


a.  It  is  possible  to  solve  for  several  right-hand  sides  simultaneously. 


b.  No  additional  storage  is  required  for  the  L,  U and  d matrices. 
They  can  occupy  the  same  storage  which  K occupied  originally. 


c.  By  saving  the  triangular  matrices  < L K) , d,  L,  and/or  U,  it  is 
possible  to  solve  for  new  sets  of  right-hand  sides  without 


■ 
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further  trlangulerlzatlon  or  factorlzat^wu. 

d.  Both  methods  may  be  combined  with  any  of  the  data  storage 
schemes,  to  be  described  under  IV.  In  particular,  both  methods, 
and  their  derivatives,  can  be  generalized  to  the  hyper-row  or 
hyper-matrix  schemes. 

e.  Both  methods  require  the  same  computational  effort.  See,  for 
example,  the  operation  count  In  ref.  [4]. 

It  Is,  therefore,  natural  to  conclude  that  the  relative  superiority 
of  one  scheme  over  the  other  can  only  be  determined.  If  at  all,  on  the 
basis  of  their  demand  on  computer  resources  — mainly  core  space  and  I/O 
operations.  The  choice  becomes,  therefore,  one  of  the  data  handling 
strategy. 

III.  ITERATIVE  METHODS 

Iterative  methods  are  potentially  useful  in  the  solution  of  large 
systems  of  equations.  Although  few  programs  utilize  such  methods,  it 
may  be  unwise  to  dismiss  iterative  methods  altogether.  Such  methods 
still  hold  certain  advantages,  such  as  low  storage  requirements,  which 
may  be  Important  In  conjunction  with  large  three  dimensional  solids 
problems,  where  storage  Is  of  prime  importance  — particularly  if  the 
current  trend  of  Increasingly  faster  central  processors,  coupled  with  a 
relatively  stagnant  backing  storage  technology,  continues.  Although  the 
paper  is  primarily  concerned  with  direct  methods,  we  list  the  following 
Iterative  approaches  which  are  most  prominent  today: 

a.  Gauss-Seldel  Method 

b.  Block  Relaxation  Methods  [2,3] 

c.  Conjugate  Gradient  Method  [7,8] 

d.  Dynamic  Relaxation  Method  [9] 


IV.  DATA  HANDLING  METHODS 


Core  storage  is  usually  not  sufficient  to  hold  both  the  co- 
efficient matrix  and  the  right-hand  side.  Special  data  handling 
techniques  are  utilized  with  two  objectives  in  mind:  low  core 
requirement  and  few  I/O  operations.  The  main  schemes  are  described 
under  this  section.  They  are  labelled  using  I for  in  core,  R for 
row,  G for  group  or  partition,  S for  submatrix.  Except  for  IB,  a two 
letter  designation  stands  for  the  disk  storage  unit  (record) , and  the 
computational  unit  (pivot),  respectively,  as  described  below. 

IV. 1 In-Core  Banded  Storage  (IB) 

Only  useful  in  small  size  problems.  In  non-linear  problems,  where 
problem  size  is  often  kept  down  out  of  necessity,  and  where  one  attempts 
to  avoid  the  use  of  mass  storage  devices,  the  scheme  is  still  heavily 
used. 

IV. 2 Banded  Out-of-Core  Storage,  Rows  (or  Columns)  Stored  Individually  (RR) 

This  scheme  requires  only  enough  core  storage  to  accommodate  two 
individual  rows  of  the  matrix  simultaneously.  The  solution  is  accomplished 
via  single  row  combinations.  It  has  the  disadvantage  of  excessive  I/O 
activity. 

IV. 3 Banded  Out-of-Core  Storage,  bg  Row  (or  Column)  Partitions  (GR.GG) 

In  which  the  banded  matrix  is  stored  in  row,  or  column,  partitions. 

Each  partition  contains  a number  of  rows,  or  columns,  and  is  read,  or 
written,  with  one  I/O  instruction.  This  method  of  operation  may  be 
combined  with  row  by  row,  or  partition  by  partition,  solution  algorithms. ^ 10 ^ 


IV. 4 Sparse  Representation  [11,12] 


In  this  method,  the  sparsity  of  the  matrix  Is  taken  into  account 
on  an  element,  or  element  string,  basis.  Each  element,  or  element  string. 
Is  accompanied  by  a header  describing  Its  size  and  position  within  the 
matrix.  An  effort  is  made  so  that  only  "active"  strings  are  present  In 
core  at  any  stage  of  the  computation.  It  is  difficult  to  evalute  such  an 
approach,  due  to  Its  problem  dependency.  We  feel,  however,  that  the 
method  can  essentially  be  described  In  terms  of  one  of  the  other  schemes, 
with  variable  band-width.  The  problems  chosen  as  a basis  for  comparison 
In  this  paper  do  not  reflect  this  property,  and  we  can  therefore  exclude 
sparse  representation  from  our  comparison. 

IV. 5 Hypermatxlx  Method  ( SG ) [13,14,15 ,16 ,17] 

The  coefficient  matrix,  as  well  as  the  right-hand  side,  are  par- 
titioned Into  compatible  blocks  called  submatrices.  Each  submatrix  is 
treated  as  a record  to  be  moved  In  or  out  of  core  with  a single  I/O 
operation.  The  Individual  submatrices  are  not  stored,  or  operated  upon, 
if  zero.  Therefore,  a number  of  pointers  are  utilized  to  indicate  the 
existence,  or  non-existence,  of  the  submatrices,  and  their  disk  addresses. 
These  pointers  may  be.  In  themselves,  regarded  as  a matrix.  For  large 
problems,  this  matrix  again  presents  a problem  and  must  be  handled  using 
one  of  the  schemes  described  under  this  section.  It  is  possible  to  treat 
it  as  a banded  matrix,  a sparsely  populated  matrix^^’^^ 

»trlx[15>. 


or  as  a hyper- 


IV. 6 Miscellaneous  Methods 


The  above  cases  have  been  chosen  as  well-defined  examples  of  data 
handling  methods.  Instances  may  exist  where  programs  use  variations  on 
the  above  methods  that  make  them  perhaps  more  efficient  In  certain 

situations,  but  also  difficult  to  classify.  As  an  example,  the  Wav. 

[18] 

Front  technique1  J utilizes  the  sparse  method  of  representation  coupled 
with  a sophisticated  disk  handling  scheme.  The  method  combines  a banded 
approach  with  a form  of  band-width  optimization  performed  simultaneously 
with  the  coefficient  matrix  assembly. 

Buffering  may  sometimes  be  useful,  where  an  I/O  buffer  is  created, 
which  contains  a number  of  records  (row,  columns  or  hypermatrices).  If 
an  I/O  operation  Is  Initiated,  the  buffer  directory  is  first  consulted 
before  an  I/O  access  is  attempted.  Exchanges  between  the  buffer  and  the 
disk  are  governed  by  a suitable  paging  algorithm.  This  technique  increases 
the  I/O  overhead  but  may  be  of  great  benefit,  particularly  if  an  additional 
amount  of  fast  secondary  core  (e.g.  Extended  Core)  is  available.  A 
similar  situation  arises  with  some  mini-computers  in  which  a FORTRAN  pro- 
gram may  only  access  a limited  amount  of  core  during  computation,  but 
may  use  addresses  beyond  its  area  for  block  data  transfers. 

V.  CLASSIFICATION  OF  MATHEMATICAL  MODELS 

The  type  of  mathematical  model  employed  has  a direct  impact  on  the 
correct  choice  of  solution  algorithm.  Ideally,  general  purpose  programs 
should  employ  the  most  efficient  all  round  algorithm  possible.  Many 
special  purpose  programs,  however,  employ  the  simplest  possible  algorithm 
to  solve  the  class  of  problems  they  are  designed  for.  We  believe, 
therefore,  that  the  classification  attempted  here  is  of  value  both  to 


special  program  designers,  and  as  a measure  of  performance  of  algorithms 
employed  in  large  general  purpose  programs.  For  the  purpose  of  our  study. 


we  choose  some  typical  grids  employed  in  a discrete  mathematical  model, 
as  typified  by  finite  element  and  finite  difference  approaches.  Figure  1 
shows  sketches  of  the  four  models  chosen.  In  each  case,  the  number  of 
grid  nodes  in  any  one  direction  is  given  by  N,  and  the  number  of  degrees 
of  freedom  by  f.  In  a heat  transfer  problem,  for  example,  f - 1.  It  is 
equal  to  3 in  elastic  solids  problems  and  6 in  shell  problems.  The  four 
models  chosen  are: 


V.l  One-Dimensional  Grids  (E1S1 


This  is  the  simplest  case,  admittedly  trivial,  utilized  in  the 
solution  of  a one-dimensional  problem.  Models  of  this  nature  are  best 
handled  in  core  using  a banded  storage  scheme. 


?- Dimensional  Grids  (E2S2) 


The  demands  on  computer  resources  are  moderate.  A banded  approach 
is  effective.  In-core  handling  is  possible  for  small  problems.  Out-of- 
core banded  schemes  (GR,GG),  such  as  in  the  SAP  program^*^  perform  well 
in  this  situation. 

V.3  Two-Dimensional  Grids  in  Three-Dimensional  Space  (E2S3) 

Such  models  are  typical  of  shell  structures,  and  are  of  great 
practical  importance.  The  chosen  example  grid  is  sufficiently  repre- 
sentative. In  practice,  such  grids  may  be  of  a highly  complex  nature. 

The  demands  on  computer  resources  are  greatly  Increased  and,  as  will 
be  shown  later,  more  sophisticated  schemes  such  as  those  employed  in  large 
scale  programs  [11,  13,  14,  15,  16,  17]  are  Imperative.  The  complexity 
of  node  interaction  dictates  a sophisticated  method  for  the  handling  of 


I 


Surface  Element 


Figure  1 . Classification  of  Model  Grids 


matrix  sparsity. 

V.4  Solid  Continuum  Problem  (E3S3) 

This  Is  the  most  demanding  of  all  models.  The  sheer  amount  of 
storage  and  computation  required  places  most  crucial  demands  on  computing 
systems.  Not  only  are  advanced  data  handling  methods  required,  but  also 
new  methods  and  techniques  should  be  constantly  Investigated.  Iterative 

[7  gi  f2] 

solution  methods1  ' J and  hybrid  approaches  1 J are  possible  candidates. 

VI.  PROBLEM  SIZE  LIMITATION  DDE  TO  AVAILABLE  SPACE 

In  order  to  evaluate  the  effect  of  computer  resources  on  program 
capacity,  storage  space  Is  first  considered.  Table  1 shows  the  limitations 
for  a number  of  model  types  and  different  numbers  of  degrees  of  freedom  — 
If  the  solution  Is  conducted  In  core,  with  40,000  words  available  core 
storage.  Both  sparse  representation,  used  in  conjunction  with  an  Iterative 
solution  method  requiring  no  factorization  of  decomposition,  and  direct 
methods  are  considered.  No  account  is  taken  of  the  storage  overhead 
necessary  to  organize  the  data. 

r 

It  Is  assumed  that  the  total  storage  requirement  remains  practically 
the  same  regardless  of  the  data  handling  method  used. 

Table  2 presents  similar  figures  for  an  out-of-core  solution,  assum- 
ing one  million  words  of  disk  space  are  reserved  for  the  stiffness  matrix. 

In  both  cases,  the  high  demand  on  computer  storage  Is  evident  — 
particularly  in  the  case  of  the  B2S3  models.  One  may  notice,  for 
example,  that  with  an  expansion  of  storage  allocation  from  40,000  to 
1,000,000  words,  the  maxluum  E3S3  grid  size,  for  a heat  transfer  problem, 
increases  only  from  a (7x7x7)  to  a (14x14x14);  l.e.  double  the  grid  re- 
solution for  a 25  fold  Increase  In  storage  space. 
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The  following  is  a key  to  the  Information  contained  In  Tables  1 
and  2: 

d - (subscript)  refers  to  direct  method, 

1 - (subscript)  refers  to  Iterative  method, 

Nd,Ni  " number  nodes  per  dimension  (grid  density), 

f ■ number  of  degrees  of  freedom/node, 

U, ,U.  ■ number  of  unknowns, 
a l 

b ■ maximum  half-band-width, 

S^,S^  - storage  requirement  for  direct  and  Iterative  solutions, 

N{J,Ni  ■ program  capacity  for  direct  and  iterative  solutions.  In 
terms  of  grid  density  resolution,  and 

a ■ matrix  sparsity  factor  (see  below) . 

VII.  .GENERALIZATION  QF  THE  BAND-WIDTH  CONCEPT 

The  concept  of  the  "band -width",  B,  and  half -band -width,  b,  have 
been  extremely  useful  In  program  design.  In  section  VI,  the  concept  was 
used  as  a basis  for  evaluating  the  relationship  between  storage  space 
available  and  maximum  problem  size  for  a chosen  set  of  highly  regular 
grids,  representative  of  typical  classes  of  physical  problems.  Schemes 
based  on  sparse  representation,  however,  such  as  described  under  IV. 4 
or  the  block-oriented  hypermatrix  scheme  described  under  IV. 5,  do  not 
operate  on  this  principle.  Such  schemes  take  into  consideration  only  those 
elements  different  from  zero,  in  both  storage  and  computation.  It  is 
necessary  to  define,  in  more  precise  terms,  a new  set  of  parameters  to 
describe  the  sparsity  of  an  arbitrary  matrix.  It  is  hoped  that  these 
new  parameters  will  play  a role  In  measuring  the  performance  of  solution 
schemes,  as  well  as  In  the  prediction  of  solution  times,  and  node  and 
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element  renumbering,  for  the  purpose  of  solution  cost  minimisation. 


This  Is  defined  as  the  total  number  of  non-zero  elements  in  a 
particular  row  or  column.  It  Is  necessary  to  distinguish  between  the 
active  local  bandwidth  before  and  after  decomposition  (or  factorization) . 
The  former,  which  is  smaller,  is  Important  in  iterative  schemes,  while 
the  latter  determines  the  computational  effort  In  a direct  scheme.  Sym- 
metric coefficient  matrices  will  now  be  considered,  although  a general- 
ization to  non-symmetric  matrices  Is  straightforward.  We  Introduce  the 
following  symbols: 


B Initial  active  local  bandwidth  for  row  (or  column)  1. 

1 

b 

1 Initial  active  local  half  bandwidth  for  row  1.  b 
is  a count  of  the  non-zero  entries  horizontally  °1 
to  the  right,  starting  with  the  diagonal  elements. 

cq  Initial  active  local  half  bandwidth  for  column.  1.  c 

1 is  measured  vertically  up,  starting  with  the  °i 

diagonal. 

Bj,  Active  local  bandwidth  for  row  (column)  1.  It  is  equivalent 
to  B for  the  coefficient  matrix  after  factorization  or 
i 

decomposition. 

bi  & C1  Active  local  half  bandwidtha  for  row  and  column  i,  re- 
spectively, after  element  propagation  due  to  a direct 
solution. 


The  reader  should  notice  that 


B - 

b 

+ c 

- 1 

°i 

°i 

°i 

and 

B - 

b. 

+ c , 

- I 

1 

1 

i 

(9) 

(9a) 


latnx  Prori „ 


This  is  defined  as  an  array  containing  all  local  bandwidth  values. 

It  describes  the  matrix  sparsity  in  a more  accurate  fashion  than  the 
usual  (maximum)  bandwidth  concept. 

B-{B  B B B } (10) 

-o  °1  °2  °1  °U 


is  the  Initial  matrix  profile,  and 

I'  (BjBj 


B„  ) (11) 


is  the  (final)  matrix  profile,  after  element  propagation. 

Similarly  for  initial  row  and  column  (half)  bandwidth  profiles  we 


b ■ { b b ...b  ...b  } 

-o  ox  o2  ot  Qu 


C • { c c 

-°  °x  o2 


• C • • • C } 

o • o 

1 u 


!' 


and  for  final  profiles  we  define 


b - (bx  b2  . . 

C - { Cj  c2  . . 


bi  . . .by} 


Cl  • . • ^ ) 


VII. 3 Generalized  Matrix  Parameters 


The  following  parameters  are  introduced.  They  represent  useful 
scalar  measures  of  sparse  matrix  properties  utilized  to  measure  space 
and  computational  requirements.  Again  only  symmetric  matrices  are  con- 
sidered, although  the  generalization  to  non-aymmetric  matrices  is  straight- 
forward. 


VII .3.1  Mean  Bandwidth 


The  mean  bandwidth  of  a matrix  is  defined  as: 


( 

i 
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(16) 


0 

b - (I  bt)/U. 

1 - 1 

This  parameter  Is  utilised  primarily  to  measure  storage  requirements. 


VII. 3. 2 Matrix  Sparsity  Factor 


o varies  from 
matrix. 


a - 2b/ (0+1)  * 2b /U 
2 


(17) 


(U+l) 


for  a diagonal  matrix  to  1 for  a fully  populated 


VTI.3.3  Computational  Bandwidth 


The  number  of  floating  point  multiplications  and  additions  Involved 
In  the  triangularlzatlon  of  a matrix  Is  approximately  equal  to  b2U/2,  for 
a matrix  with  a constant  local  bandwidth.  This  leads  to  a bandwidth 
parameter,  called  the  computational  half  bandwidth,  which  may  be  used  for 
solution  time  estimates. 

0 U 

b - [E  b|/ur  (18) 

1 - 1 

VIII.  MEASUREMENT  OF  PROGRAM  PERFORMANCE 

Solution  routines,  based  on  a basic  mathematical  algorithm,  are 
difficult  to  compare.  The  following  performance  measurement  parameters 
are  suggested  to  help  comparison  and  evaluations. 

VIII. i storage  Efficiency  Parameters  (E  ) 

The  minimum  storage  required  for  a coefficient  matrix  is  given  by: 


Smln  “ b U* 


(19) 


However,  due  to  program  organization,  additional  space  may  be 
needed  to  structure  the  data.  If  the  total  amount  of  space  required  to 
store  and  manage  the  coefficient  matrix  Is  given  by  S,  the  storage 
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efficiency  of  the  program  Is  defined  by: 


’ 

■ 


» 


i 

' 


i 


F . S|nln 

Es  "s“ 


VIII. 2 Computational  Efficiency  (Ej 

The  total  amount  of  operations  needed  by  the  Gaussian  elimination  Is 
approximately  equal  to 

U bj 

(^  - r ( + 2BiP)  (20) 

1 - 1 

where  p Is  the  number  of  right-hand  sides.  However,  considerable  over- 
head Is  usually  associated  with  data  management.  One  way  to  measure 
efficiency  is  to  compare  the  central  processor  time,  T,  taken  by  the 
program  to  solve  the  equations  with  the  approximate  time  for  the  Gaussian 
elimination  T^,  defined  as 


T8  • % (tM  + V' 


(21) 


where  t^  and  tA  are  the  CP  times  required  to  perform  a floating  point 
multiplication  and  an  addition,  respectively,  on  the  given  computer, 
program  computation  efficiency,  E^,  is  defined  as 

. V (t«  * V 

c T 

IX.  COMPUTER  RESOURCE  REQUIREMENTS 


The 


(22) 


A solution  scheme  has  certain  demands  on  the  computer.  In  measuring 
the  total  burden  on  the  system  one  has  to  consider  the  utilization  of  all 
its  resources.  In  detail  they  are: 

a.  Central  processor  time  to  perform  the  necessary  computations. 
Overhead  is  also  present  due  to  data  management  functions  that 
the  program  has  to  perform. 

b.  Core  space  required  to  store  program  and  data.  In  an  out-of-core 
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solution,  a part  of  this  space  Is  used  as  a dynamic  working 
space  containing,  at  any  one  time,  only  a fraction  of  the  data 
required  for  the  solution  of  the  problem.  Core  space  is  an 
important  and  limited  resource.  This  fact  is  reflected  in  most 
charge  algorithms,  which  use  the  amount  of  core  utilized  by 
a program  as  weighting  factor  in  assessing  system  charges. 

c.  Input/Output  (1/0)  time.  Ignoring  hard  to  measure  factors  such 
as  system  I/O  buffering  and  virtual  memory  machine  operation, 
each  explicit  request  by  the  program  to  the  system  to  either 
read  or  write  a block  of  data  from  an  I/O  device,  typically  a 
disk,  constitutes  a demand  on  system  resources.  The  system  I/O 
charges  are  based  on  both  the  initial  disk  access  (es)  and  the 
amount  of  data  transferred.  Both  elements  will  be  used  as 
criterion  In  evaluating  the  I/O  requirements  of  a particular 
algorithm.  It  is  interesting  to  note  that  in  large  computer  sys- 
tems I/O  operations  require  extensive  amounts  of  associated  CP. 

d.  Disk  Space  Requirements.  Finally,  a certain  amount  of  disk 
space  is  required  to  store  primary  problem  data  as  well  as 
organizational  data  used  to  manage  it.  Disk  require- 
ments may  or  may  not  be  a factor  in  assessing  computer  run 
charges.  If  the  data  is  retained  on  disk  permanent  files  after 
job  completion,  disk  storage  is  charged. 


IX. 1 Comparing  Resource  Requirements  for  Different  Solution  Schemes 

In  order  to  determine  the  suitability  of  the  various  schemes  for 
the  solution  of  typical  model  types,  we  have  to  assess  the  total  effect 
due  to  all  four  factors  listed  above.  Studies  by  the  authors  and  others 
(see  for  example  [9]),  show  that  central  processor  time  utilized  in  the 
basic  solution  steps  is  more  or  less  invariable  — regardless  of  the 
mathematical  tool  and  data  handling  scheme  used.  It  can  be  also  shown 
that  disk  space  requirement  is  practically  constant.  It  follows  that  the 
two  most  important  factors  are  those  of  core  storage  and  the  I/O  time. 

The  purpose  of  this  section  is  to  study  the  amount  of  both  resources 
and  their  dependence  on  the  model  type  and  the  data  handling  scheme.  Only 
the  fundamental  data  schemes  will  be  considered.  The  following  gives  a 
brief  description  of  each  method.  Each  scheme,  except  for  IB,  is  designated 
by  a two  letter  code.  The  first  denotes  the  storage  unit  which  may  be  a 


row  (R) , a group  of  rows  (G)  or  a sub-matrix  (S) . The  second  denotes 
the  solution  (pivoting)  unit.  The  method  may  solve  row  by  row  or  group 
by  group. 


a.  In-core  banded  solution  (IB). 

b.  Out-of-core  banded  solution  (RR) , based  on  active  bands.  Each 
single  row  is  stored  as  an  independent  record  and  may  be  read 
or  written  with  a single  I/O  instruction.  Pivoting  is  by  rows. 

c.  Out-of-core  banded  solution.  Rows  stored  in  blocks.  Solution 
proceeds  row  by  row  (GR) . For  a particular  problem,  the  more 
rows  there  are  in  a group,  the  more  core  is  needed  and  the  less 
the  number  of  I/O  operations.  In  balancing  the  two  factors, 
program  performance  may  be  optimized  as  shown  in  section  VIII. 

d.  Out-of-core  banded  solution.  Rows  stored  in  blocks.  Pivoting 
is  by  groups  (GG) . This  method  is  the  most  economical  in  I/O 
operations  (see  Table  3) , but  requires  excessive  core  storage 
for  large  problems. 

e.  Out-of-core  hypermatrix  solution  (SG)  in  which  the  grouping  of 
unknowns  is  done  on  both  rows  and  columns  in  a compatible  manner. 

The  coefficient  matrix  is  divided  into  sub-matrices.  Each  sub- 
matrix is  read  into  core  or  written  on  disk  with  one  I/O 
instruction.  This  method  is  the  most  economical  in  core  storage 
utilization  for  large  problems. 

The  core  storage  requirement  for  the  hypermatrlx  is  that  sufficient 
to  hold  four  or  five  sub-matrices,  as  well  as  pointers  to  a number  of 
related  sub-matrices.  In  the  simplest  possible  of  schemes  the  full 
pointer  matrix  is  stored  in  core.  This  approach,  however,  severely,  limits 
problem  size.  In  order  to  reduce  the  amount  of  pointer  storage,  any  of 
the  approaches  applied  to  the  coefficient  matrix  Itself  may  again  be 
utilized.  In  particular,  the  following  possibilities  exist: 

(i)  Pointer  matrix  stored  in  core,  as  a full  matrix  (SG-IF) . 

(ii)  Pointer  matrix  stored  in  core,  as  a banded  (sparse)  matrix  (SG-IB) . 

Pointer  matrix  stored  out  of  core,  as  a banded  matrix,  only  the 
relevant  rows  are  held  in  core  (SG-OB) , We  will  only  consider 
the  case  where  each  record  stores  the  addresses  to  a row  of  sub- 
matrices,  in  a sparse  format.  All  active  address  rows  are  pre- 


(iii) 


sent  in  core 


(lv)  Pointer  matrix  stored  out  of  core,  as  a hypermatrix  (SG-H) . 

The  SG-OB  method  for  storage  of  pointer  matrix  rows  will  be  considered 
in  assessing  core  storage.  It  will  be  assumed  that  the  additional  I/O 
operations  to  handle  pointers  have  a negligible  effect.  It  Is  also  assumed 
that,  for  the  hypermatrix  scheme,  the  right-hand  side  will  be  stored  In- 
dependently of  the  coefficient  matrix,  in  compatible  blocks.  It  Is  possible, 
therefore,  to  solve  for  many,  right-hand  sides  simultaneously.  If  the 
number  of  right-hand  sides  exceeds  the  number  of  rows  allowable  in  a 
partition,  the  right-hand  side  matrix  can  be  partitioned  column-wise  also. 
This  case  will  not  be  considered  here. 

IX. 2 Comparison  of  Computer  Resource  Requirements 

Table  3 shows  a comparison  between  the  various  solution  algorithm/ 
data  handling  combinations  and  their  core  and  I/O  requirements.  A banded 
matrix  is  assumed.  In  Interpreting  the  given  data,  however,  one  must 
realize  that  a well  written  sparse  scheme  will  essentially  provide  the 
same  results,  with  the  active  band  width  taking  the  place  of  the  maxi- 
mum band  width.  Each  method  is  designated  with  the  letter  identifier 
used  In  the  last  section  (IB,  HR,  GR,  GG,  SG) . In  addition,  the  letter 
T will  denote  the  trlangularlzatlon  (Gaussian  Elimination)  scheme  and 
the  letter  F the  factorization  (Cholesky,  Crout,  etc.). 

In  the  banded,  row  oriented  solution  schemes,  it  is  assumed  that 
the  right-hand  alde(s)  are  stored  with  the  corresponding  rows.  In  the 
hypermatrix  scheme  the  right-hand  sides  are  stored  In  separate  blocks. 

The  following  notation  is  employed: 

r - No.  of  rows  in  a block,  or  submatrix  size. 
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9 * No.  of  non-zero  submatrices  in  a hyperrow  = b/r 
G - Total  no.  of  partitions  = U/r. 


t: 

i 


j. 

i 


\ 


Data  Handling  Method 

Core  Requirement 

Number  of  I/O  Calls 

Record  Size 

IB-T 

IB-F 

Ub 

0 

- 

RR-T 

2b 

4bU 

b 

RR-F 

2b 

5bU 

b 

GR-T 

b(r+l) 

4sU 

br 

GR-F 

b(r+l) 

5sU 

br 

GG-T 

2br 

4sG 

br 

GG-F 

2br 

5sG 

br 

SG-T 

4r2  + s2 

r2 

SG-F 

5r2  + s2 

r2 

TABLE  3.  Comparison  of  Core  Requirements  and  I/O 
Operations  for  Data  Handling  Schemes 


IX . 3 Examples 

Figures  2,  3 and  4 describe  program  requirements  during  the 
solution  of  a two-dimensional  model  (E2S2)  with  two  degrees  of  freedom 
per  node.  In  solution  schemes  based  on  row  and/or  column  grouping, 
blocks  of  30  rows  each  are  assumed.  Fig.  2 shows  core  requirements 
for  the  various  methods  as  a function  of  problem  size.  The  group 
storage,  group  reduction  scheme,  GG,  shows  an  advantage  over  the 
hypermatrix  scheme,  for  models  up  to  N-28  (U=1600) . Figures  3 and  4 
shows  the  number  of  I/O  calls,  and  demonstrates  definite  superiority 
of  GG  and  SG  over  the  GR  and  RR  schemes.  Figure  4 shows  a slight 
advantage  for  the  SG  over  the  GG  method. 
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Figures  5,  6,  and  7 show  the  core  and  I/O  requirements  for  a two- 
dimensional  problem  with  3200  unknowns  and  a half  bandwidth  of  84  as  a 


function  of  partition  size.  As  seen  from  figure  4,  the  hypermatrix 
scheme  uses  somewhat  less  core  than  GG  for  partition  sizes  of  up  to  35 
or  40.  GG  and  SG  require  less  I/O  than  GR  throughout,  with  SG  again 
holding  a definite  edge. 

It  appears,  therefore,  from  figures  2 to  7 that  both  the  GG  and  SG 
schemes  are  superior  in  handling  cf  two-dimensional  problems. 

Moving  to  shell-type  problems  (E2S3) , figure  8 illustrates  core 
requirements  as  a function  of  partition  size.  It  is  quite  apparent  that, 
for  large  bandwidth  problems,  the  hypermatrix  scheme  requires  substantially 
less  core.  From  an  I/O  point  of  view  (fig.  9),  GG  is  again  inferior  to 
SG. 


The  same  conclusions  are  valid  for  solids  problems,  as  illustrated 
by  figures  10  and  11. 

X.  CHOICE  OF  OPTIMUM  PARTITON  SIZES 

In  solution  algorithms  involving  matrix  partitioning  in  one 
direction  (GR-T,  GR-F,  GG-T,  GG-F)  or  in  both  directions  (SG-T  and  SG-F) , 
large  block  sizes  result  in  efficient  disk  operation  at  the  expense  of 
core  space.  The  optimum  partition  size  depends  on  the  hardware  avail- 
able and  the  charging  algorithm  utilized  by  the  computer  center.  In 
order  to  illustrate  how  one  may  optimize  the  size  of  the  partition  for  a 
particular  system,  two  cases  are  considered.  One  case  represents  a 
typical  computer  center  environment  where  central  processor  time,  I/O 
time  and  core  requirements  are  all  taken  into  account  in  the  charge 
algorithm.  The  other  case  is  that  of  a mini-computer  in  which  residence 
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CORE  REQUIREMENTS 
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Figure 
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E2S2  f =2  N=40 


>.  Core  Requirement  for  an  E2S2  Model  aa  a Function  of  Partition  Size 

(U-3200,  b+84) 
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Figure  9.  I/O  Requirements  es.e  Function  of  Partition  Size  for  an  E2S3  Model 


I/O  CALLS 


E3S3  F=3  Nsl5 


Figure  11.  I/O  Requirements  for  e Solids  Problem 


(U-10125,  b-726) 


time  and  limited  core  are  the  only  considerations 


X. 1 Computer  Center  Environment 

A typical  charge  algorithm  in  a computer  center  environment  Is  given 
by  the  following  formula**: 

Cost  ■ B *(a  C + S H y S * L) 
where, 

R “ System  cost/hour 
C - Central  processor  time,  in  hours 

P - Peripheral  processor  (I/O)  time  in  hours.  It  is  defined  as 
the  total  time  the  program  occupies  an  I/O  channel,  and  is 
a function  of  both  the  number  of  disk  accesses  and  amount  of 
words  transferred. 

5 - Core  storage,  in  words 
L ■ Larger  of  C and  P. 

Typical  values  for  the  constants  are 

R - $500 

a - 0.6 

6 - 0.15 

Y - 1.63  x 10-5. 

Furthermore,  P is  computed  from 

P ■ p W + p.  A 
*w  ^A 

where 

W » the  number  of  words  transferred, 

and 

A * number  of  accesses. 

**Based  on  the  University  of  Arizona  Computer  Center's  CDC  6400 
system 
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For  the  Installation  under  consideration,  the  constants  are  given  by 

p ■ 1.11  x 10-8  hours/word 
rw 

p,  - 1.04  x 10-6  hours/access. 

A 

Assuming  the  program  has  a computational  efficiency  of  E^,  the  central 
processor  time  is  given  by: 


V'.  + V (t.  + ca>  ,m,2  . 

" I “ E K 2 


4bp) . 


The  peripheral  processor  (I/O)  time  for  the  various  methods  is  given  by 
Table  3.  M can  be  put  in  the  form 

M ■ Ms  + “p  + “d 

where 


M is  system  memory  requirement 

Mp  is  the  user  program  requirement,  excluding  data. 


Mp  is  the  data  requirement,  given  in  Table  3 for  the  different 
schemes . 


Case  Study  1 


Let  us  examine  an  E2-S3  model  with 


N - 20 


f - 6 


U - 4f  N(N-l)  - 9120 


b - 8f (N-l)  - 912. 

Let  us  assume  5 loading  cases.  Furthermore,  for  a CDC  6400, 
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t * 57  x 10  ^ sec. 
m 

t * 11  x 10  ^ sec. 

A 

We  assume  that,  for  a typical  program, 

E - 0.8 
c 

also 

Mg  ■ 5000  words 

Mp  ■ 12000  words. 

Figure  12  shows  the  computed  cost  for  the  given  problem,  on  the 
chosen  installation  using  both  the  GG  and  SG  schemes.  It  is  interesting 
to  note  the  sharp  minimum  exhibited  by  the  GG  method  at  about  r * 35  to 
40,  due  to  the  central  processor  time  becoming  predominant,  coupled  with 
the  high  core  requirements.  One  notes  further  that  the  required  core 
reaches  64K  at  r - 26  and  131K  for  r - 62.  On  the  other  hand,  the 
hypermatrix  scheme  (SG)  exhibits  a flat  minimum  extending  from  r * 20 
to  r - 40.  '’’he  minimum  cost  for  the  SG  is  approximate- 

ly $5,500,  compared  to  $10,300  for  GG.  These  facts  show  not  only  the 
clear  superiority  of  the  hypermatrix  scheme  for  complex  shell  problems, 
but,  assuming  the  charge  algorithm  is  reasonable  also  the  unsuitability 
of  the  computer  installation  for  handling  problems  of  this  type  and 
magnitude. 

XI,  MINI -COMPUTER  ENVIRONMENT 

In  a mini-computer  environment,  the  cost  is  a function  of  the 
residence  time  rather  than  system  time.  A typical  cost  formula**  will 

**Based  on  a PDP-15  computer,  32K  of  core,  18  bit  words. 
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Figure  12.  Effect  of  Partition  Size  on  Solution  Costs  for 

Rcw  Blocking  (GG)  and  the  Hypermatrix  Schece  (SG) 
(U-9120,  b-912) 


be  of  the  form: 


Cost  - R *(C  + B P). 

Typical  values  are 

R - $15. 00/hour 

and 

B - 8.333  E-6  hours/call 


(t  + t.)  ...  2 

C - — Sg (-^  + *bp) 


where,  for  a mini-computer  without  floating  point  processor 


—8 

t,  ■ 5.556  x 10  hours 
A 


and 

Q 

t » 5.833  x 10  hours, 
m 

With  a floating  point  processor. 


t “ 4.167E-9 
m 

tA  - 2.778E-9 
A 

P - No.  of  I/O  calls 

The  number  of  rows  per  group  is  not  determined  by  the  charge  algorithm, 
but  rather  by  the  availability  of  core.  Assuming  a core  of  32K  18  bit 
words,  and  assuming  that  the  system  occupies  6K  words,  and  the  program 
5K,  then  a total  of  21K  is  available  for  data.  Assuming  extended  accuracy 
(3  18  bit  words  per  floating  point  number)  then,  for  GG, 

21000  - 2br  x 3 - 6br. 
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XI. 1 Case  Studies 

The  mini-computer  cost  algorithm  Is  applied  to  the  same  complex 
shell  problem  used  previously,  with  9120  degrees  of  freedom  and  a half 
bandwidth  of  912,  (see  figures  13  and  14).  Since  the  amount  of  core 
available  is  limited  to  32K  of  18  bit  words,  the  curves  are  discontinued 
when  this  limit  is  reached.  It  is  obvious  that  the  cost  is  reasonable 
if  a floating  point  processor  is  available  — $600  for  SG  and  $1100  for 
GR.  The  running  times,  however,  are  40  and  70  hours,  respectively!  It 
would  appear  that  improvement  of  mini-computer  speeds  is  required  before 
problems  of  this  magnitude  can  be  tackled  in  one  computation. 


XII.  CONCLUSION 

The  paper  establishes  a systematic  approach  to  the  comparison  of 
basic  solution  techniques  and  data  handling  strategies.  New  parameters 
are  introduced,  which  may  be  used  to  describe  sparsely  populated  matrices, 
and  estimate  times  and  costs  of  solving  the  equations,  in  advance  of  the 
computation.  It  is  demonstated  how  one  can  optimize  run  costs  for  a 
particular  solution  algorithm,  a particular  problem  and  a particular 
Installation.  From  the  results  it  appears  that  both  a partitioned  banded 
approach  and  the  hypermatrix  scheme  are  suitable  for  the  solution  of  medium 
size  problems  with  a narrow  active  bandwidth.  For  large  shell  and  solids 
problems,  the  hypermatrix  scheme  seems  to  be  more  efficient.  Mini-computers 
compare  favorably  with  large  systems  in  running  costs,  but  may  require  un- 
acceptable running  times  for  large  problems. 
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